Load Library Package

“Use the Tidyverse, Luke” – O-W.Kenobi

library(tidyverse)
library(skimr)
library(plotly)

Get Data

Crossref data used from the Setup to the LC OpenRefine Workshop

crossref_data <- read_csv("https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))

Take a quick look at the data

glimpse(crossref_data)
Observations: 1,001
Variables: 11
$ Title     <chr> "The Fisher Thermodynamics of Quasi-Probabilities", "Aflatoxin Contamination of the...
$ Authors   <chr> "Flavia Pennini|Angelo Plastino", "Naveed Aslam|Peter C. Wynn", "Rafael R. C. Cuadr...
$ DOI       <chr> "10.3390/e17127853", "10.3390/agriculture5041172", "10.3390/ijms161226101", "10.339...
$ URL       <chr> "https://doaj.org/article/b75e8d5cca3f46cbbd63e91be5b32412", "https://doaj.org/arti...
$ Date      <date> 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11, 2015-01-11...
$ Language  <chr> "English", "English", "English", "EN", "EN", "English", "English", "English", "Engl...
$ Subjects  <chr> "Fisher information|quasi-probabilities|complementarity|Physics|QC1-999|Science|Q",...
$ ISSNs     <chr> "1099-4300", "2077-0472", "1422-0067", "2304-6740", "2306-5338", "1420-3049", "2073...
$ Publisher <chr> "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI AG", "MDPI ...
$ Citation  <chr> "Entropy, Vol 17, Iss 12, Pp 7848-7858 (2015)", "Agriculture (Basel), Vol 5, Iss 4,...
$ Licence   <chr> "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "CC BY", "C...
crossref_data

skimr

Skimr is a easy way to have a quick look at the variables in the data frame. In this case the data are mostly character string data. With numeric data skimr will produce a thumbnail histogram (sparkline )

skim(crossref_data)
Skim summary statistics
 n obs: 1001 
 n variables: 11 

-- Variable type:character -----------------------------------------------------
  variable missing complete    n min max empty n_unique
   Authors       0     1001 1001   7 291     0      883
  Citation       0     1001 1001  39 104     0     1000
       DOI      23      978 1001  16  29     0      977
     ISSNs       0     1001 1001   9  19     0       51
  Language      15      986 1001   2   7     0        4
   Licence       6      995 1001   5  11     0        3
 Publisher       0     1001 1001   7  47     0        6
  Subjects       0     1001 1001  17 337     0      988
     Title       0     1001 1001  18 318     0     1000
       URL       0     1001 1001  57  57     0     1000

-- Variable type:Date ----------------------------------------------------------
 variable missing complete    n        min        max     median n_unique
     Date       0     1001 1001 2015-01-01 2015-01-12 2015-01-07       12

Faceting

Two methods to generate a quick table of the languages represented in the dataframe: count() and forcats::fct_count. Since these data are primarily character, it’s helpful to learn about factor data and the forcats package. These two tables are the same. It looks like the data are published in English (spelled two different ways), FRench and Spanish.

crossref_data %>% 
  count(Language)

fct_count(crossref_data$Language, sort = TRUE)

This time, facet on the governing license. All but six articles are covered by a createive commons license.

crossref_data %>% 
  count(Licence)

Facet on the publisher. Sort in descending order.

crossref_data %>% 
  count(Publisher, sort = TRUE)

Facet by authors, and sort by the most prolific. This field appears to be a multi-valued field that is pipe | separated. How do we count and visualize how many articles have multiple authors?

crossref_data %>% 
  count(Authors, sort = TRUE) 

The above table is not very useful (unless tracking publishing teams that are always expressed identically.) Let’s exploring some methods to generate a count of the pipe character separating each author in a single author field. The stringr::str_count() function is a great way to calculate the number of delimiters in each author field.

Note that counting a pipe character | requires using a Regular Expression, or regex. Anyone manipulating string characters with computers will be far more capable after spending some time learning about regular expressions. In this case the we’re looking for a pipe character |. The special trick, here, in understanding regex is to know that a pipe character has special meaning. Therefore we have to escape, or make it know that we want the literal pipe character and not the special meaning pipe character. To escape a character in regex one uses a backslash \. But the weird part is that, in R, one has to escape the the escape character: \\| means look for a literal |.

Below we count the number of pipe characters in each row of the Author field. Using the head function we only display the first six values (rows) in the Author column.

str_count(crossref_data$Authors, "\\|") %>% head()
[1] 1 1 2 3 2 3

Transform Data

Use dplyr::mutate to generate a new field that calculates how many authors each observation contains.

crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  select(Authors, multi_authorship)

Visualize

Authors

Generate a histogram distribution of the multiple authorship variable.

crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  select(multi_authorship, Authors) %>% 
  ggplot() +
  aes(multi_authorship) +
  geom_histogram(binwidth = 1)

This time generate as a bar graph and sort by the most frequent representation. Articles with five authors is the most frequent representation in the dataset.

auth_count <- crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  mutate(multi_authorship = as.character(multi_authorship)) %>% 
  select(multi_authorship, Authors)

ggplot(auth_count) +
  aes(fct_infreq(multi_authorship)) +
  geom_bar() +
  ggtitle("Multiple Authorship")

Explore Subject Headings

Visualize the frequency of multiple subject headings, just as with authors (A bar graph and a histogram)

crossref_data %>% 
  mutate(SH_count = str_count(Subjects, "\\|") + 1) %>% 
  mutate(SH_count = as.character(SH_count)) %>% 
  ggplot() +
  aes(fct_infreq(SH_count)) +
  geom_bar()


crossref_data %>% 
  mutate(SH_count = str_count(Subjects, "\\|") + 1) %>% 
  ggplot() +
  aes(SH_count) +
  geom_histogram(binwidth = 1) 

Data Transformations

Using dplyr, mutate a new variable and transform the data so that ‘EN’ and ‘English’ are the same. Transform ‘ES’ to “Spanish”, and ‘FR’ to “French”.

dplyr::case_when() is one specialized way to perform an if_else transformation.

crossref_data %>% 
  count(Language)

Since EN and English are synonymous, let’s combine them into a single value. case_when is a great function for collapsing values.

crossref_data <- crossref_data %>% 
  mutate(Language = case_when(
    Language == "EN" ~ "English",
    Language == "ES" ~ "Spanish",
    Language == "FR" ~ "French"
  )) 

Visualize the Languages.

Stacked Bar graph shows frequency by Language. Each stack of a bar distinguishes the publishers. English Language is huge and somewhat over-powers the reset of the graph. Make a second graph (below) to drill down on the lesser represented languages.

published_languages_bargraph <- crossref_data %>% 
  ggplot() +
  aes(fct_infreq(Language), fill = Publisher) +
  geom_bar()

published_languages_bargraph

Filter the data to show only the “NA”, “French”, and “Spanish”.

crossref_data %>% 
  filter(is.na(Language) | Language == "French" | Language == "Spanish") %>% 
  ggplot() +
  aes(fct_infreq(Language), fill = Publisher) +
  geom_bar() +
  labs(title = "Published Languages",
       subtitle = "NA or Non-English",
       caption = "Data Source: Crossref.org")

Time Series

published_over_time <- crossref_data %>% 
  count(Date) %>% 
  ggplot(aes(Date, n)) +
  geom_point() +
  geom_line() +
  labs("Publishing Frequency by Day", 
       subtitle = "January, 2015")
  
published_over_time

Interactive

Using Plottly’s ggplotly function, generate visualizations that are available for interactive mousing (i.e. subsetting and exploring). Gadgets such as sliders, drop-down menus, selection boxes and radio buttons are available and especially useful when combining library(crosstalk) with library(flexdashboards) as seen in the opening tab of this demonstration dashboard

ggplotly(published_languages_bargraph)
ggplotly(published_over_time)
---
title: "first look at the dataset"
output:
  pdf_document: default
  html_notebook: default
  word_document: default
---

- library data
- automation

## Load Library Package

"Use the [Tidyverse](https://www.tidyverse.org/), Luke" -- O-W.Kenobi 

```{r}
library(tidyverse)
library(skimr)
library(plotly)
```

## Get Data

Crossref data used from the **Setup** to the [LC OpenRefine Workshop](https://librarycarpentry.org/lc-open-refine/setup.html)

```{r}
crossref_data <- read_csv("https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
```
Take a quick look at the data

```{r}
glimpse(crossref_data)
crossref_data
```

## skimr

Skimr is a easy way to have a quick look at the variables in the data frame.  In this case the data are mostly character string data.  With numeric data skimr will produce a thumbnail histogram (sparkline )

```{r}
skim(crossref_data)
```


## Faceting

Two methods to generate a quick table of the languages represented in the dataframe:  `count()` and `forcats::fct_count`.  Since these data are primarily character, it's helpful to learn about factor data and the forcats package. These two tables are the same.  It looks like the data are published in English (spelled two different ways), FRench and Spanish.

```{r}
crossref_data %>% 
  count(Language)

fct_count(crossref_data$Language, sort = TRUE)
```

This time, facet on the governing license.  All but six articles are covered by a [createive commons](https://creativecommons.org/) license.

```{r}
crossref_data %>% 
  count(Licence)
```

Facet on the publisher.  Sort in descending order.

```{r}
crossref_data %>% 
  count(Publisher, sort = TRUE)
```

Facet by authors, and sort by the most prolific.  This field appears to be a multi-valued field that is pipe `|` separated.  How do we count and visualize how many articles have multiple authors?

```{r}
crossref_data %>% 
  count(Authors, sort = TRUE) 
```

The above table is not very useful (unless tracking publishing teams that are always expressed identically.)  Let's exploring some methods to generate a count of the pipe character separating each author in a single author field.  The `stringr::str_count()` function is a great way to calculate the number of delimiters in each author field.  

Note that counting a pipe character `|` requires using a Regular Expression, or [regex](https://en.wikipedia.org/wiki/Regular_expression).  Anyone manipulating string characters with computers will be far more capable after spending some time learning about regular expressions.  In this case the we're looking for a pipe character `|`.  The special trick, here, in understanding regex is to know that a pipe character has special meaning.  Therefore we have to escape, or make it know that we want the literal pipe character and not the special meaning pipe character.   To escape a character in regex one uses a backslash `\`.  But the weird part is that, in R, one has to escape the the escape character:  `\\|` means look for a literal `|`.

Below we count the number of pipe characters in each row of the Author field.  Using the `head` function we only display the first six values (rows) in the Author column.

```{r}
str_count(crossref_data$Authors, "\\|") %>% head()
```

## Transform Data

Use `dplyr::mutate` to generate a new field that calculates how many authors each observation contains.

```{r}
crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  select(Authors, multi_authorship)
```

## Visualize

### Authors

Generate a histogram distribution of the multiple authorship variable.

```{r}
crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  select(multi_authorship, Authors) %>% 
  ggplot() +
  aes(multi_authorship) +
  geom_histogram(binwidth = 1)
```

This time generate as a bar graph and sort by the most frequent representation.  Articles with five authors is the most frequent representation in the dataset. 

```{r}
auth_count <- crossref_data %>% 
  select(Authors) %>% 
  mutate(multi_authorship = str_count(Authors, "\\|") + 1) %>% 
  mutate(multi_authorship = as.character(multi_authorship)) %>% 
  select(multi_authorship, Authors)

ggplot(auth_count) +
  aes(fct_infreq(multi_authorship)) +
  geom_bar() +
  ggtitle("Multiple Authorship")
```

### Explore Subject Headings

Visualize the frequency of multiple subject headings, just as with authors (A bar graph and a histogram)

```{r}
crossref_data %>% 
  mutate(SH_count = str_count(Subjects, "\\|") + 1) %>% 
  mutate(SH_count = as.character(SH_count)) %>% 
  ggplot() +
  aes(fct_infreq(SH_count)) +
  geom_bar()

crossref_data %>% 
  mutate(SH_count = str_count(Subjects, "\\|") + 1) %>% 
  ggplot() +
  aes(SH_count) +
  geom_histogram(binwidth = 1) 
```


## Data Transformations

Using dplyr, mutate a new variable and transform the data so that 'EN' and 'English' are the same.  Transform 'ES' to "Spanish", and 'FR' to "French".  

`dplyr::case_when()` is one specialized way to perform an `if_else` transformation.

```{r}
crossref_data %>% 
  count(Language)
```

Since `EN` and `English` are synonymous, let's combine them into a single value.  `case_when` is a great function for collapsing values.

```{r}
crossref_data <- crossref_data %>% 
  mutate(Language = case_when(
    Language == "EN" ~ "English",
    Language == "ES" ~ "Spanish",
    Language == "FR" ~ "French"
  )) 
```

### Visualize the Languages. 

Stacked Bar graph shows frequency by Language. Each stack of a bar distinguishes the publishers. English Language is huge and somewhat over-powers the reset of the graph.  Make a second graph (below) to drill down on the lesser represented languages.

```{r}
published_languages_bargraph <- crossref_data %>% 
  ggplot() +
  aes(fct_infreq(Language), fill = Publisher) +
  geom_bar()

published_languages_bargraph
```

Filter the data to show only the "NA", "French", and "Spanish".

```{r}
crossref_data %>% 
  filter(is.na(Language) | Language == "French" | Language == "Spanish") %>% 
  ggplot() +
  aes(fct_infreq(Language), fill = Publisher) +
  geom_bar() +
  labs(title = "Published Languages",
       subtitle = "NA or Non-English",
       caption = "Data Source: Crossref.org")
```

## Time Series

```{r}
published_over_time <- crossref_data %>% 
  count(Date) %>% 
  ggplot(aes(Date, n)) +
  geom_point() +
  geom_line() +
  labs("Publishing Frequency by Day", 
       subtitle = "January, 2015")
  
published_over_time
```
## Interactive

Using Plottly's `ggplotly` function, generate visualizations that are available for interactive mousing (i.e. subsetting and exploring).  Gadgets such as sliders, drop-down menus, selection boxes and radio buttons are available and especially useful when combining library(crosstalk) with library(flexdashboards) [as seen in the opening tab of this demonstration dashboard](https://rfun-flexdashboards.netlify.com/)


```{r}
ggplotly(published_languages_bargraph)
```


```{r}
ggplotly(published_over_time)
```

